Introduction

This PDF serves as documentation for the 1.Data_cleaning_and_ward_matching.R R script, a code written to ascertain the location of a large number of schools over the Zambian territory by identifying which Ward and District each school observation falls into. Such a process is needed given three main challenges faced during the data collection of school information in Zambia:

  • An important portion of our school dataset is without GPS information
  • The GPS coordinates provided for each school observation are not always accurate, hence may be misleading in our efforts to situate schools in their respective wards and districts.
  • School observations have been collected over time (between 2006 and 2017), and although their locations do not change, the wards and districts in which they fall do, given changes in administrative borders and names.

With the issues listed above, there is a need to harmonize our data, hence to conduct a matching process with reference to a specific base map, which in our case is the population_census_2010 shapefile. Our input dataset with regards to Zambia schools is named school_electricity.dta.

Input Files

Our code uses three input files:

  • school_electricity.dta : A dataset of all schools spanning the Zambian territory, collected overtime at a school-year level and with location information including gps, ward and district.
  • Zambia_Wards-2010-Census : A shapefile representing the Map of Zambia as of 2010, with administrative boundaries detailed down to the WARD level.
  • population_census_2010 : A shapefile representing the Map of Zambia as of 2010, with administrative boundaries detailed down to the DISTRICT level.

Structure of the code

The code is organized as such:

  • TASK 1 - Data Cleaning: recover ward info & identify schools with unique wards
  • TASK 2 - Ward Matching & Bridging
    1. Adjusting wards map to 2010 district census map
    2. Adjusting school data to 2010 district census map
    3. Matching exercise: mapping schools against wards map (spatial join)
    • Part 1: spatial join
    • Part 2: matching
      • case 1: same districts, same wards
      • case 2: same districts, similar wards
      • case 3: same districts, different wards
      • case 4: different districts, similar wards
      • case 5: different districts, same wards
      • case 6: different districts, different wards
    1. Bridging exercise: creating a bridge for observations with no gps informaion
    2. Results Summary

Getting started

The packages required for running the code are the following:

# 1.0 For using data.table syntax
require(haven) # read dta file
require(data.table)

# 1.1 Basic packages necessary for manipulating maps/simple features
library("sf")
library("sp")
library("dplyr") # For relocate() feature

# 1.2  For loading geometries with OGR
library(rgdal)  

# 1.3 For loading .geojson files
library(geojsonsf)
library(jsonlite)

# 1.4 For comparing words
require(RecordLinkage)

# PLOTTING MAPS:
library("ggplot2")
library("ggspatial")
library("RColorBrewer")
library("colorRamps") 

Description of the code content

TASK 1: Data Cleaning: Recover Ward Info & Identify Schools with Unique Wards

The first part of our code consists in cleaning our initial school dataset , school_electricity.dta, by dropping a number of unnecessary columns, dropping observations with no school name, recovering some Ward information for a number of observations for which the ward information is existing for a limited number of years, and finally, identifying schools whose Ward information change over the years. Such information has been recorded in a new column named unique_ward.

Example of school with ward information is existing for a limited number of years

The output dataset of this cleaning process was then saved as zambia_schools.csv.

schools <- read_dta("school_electricity.dta")
#convert existing objects to a data.table using setDT() (for data.frames and lists)
schools_dt <- setDT(schools)
# Check Total number of observations before cleaning:
schools_dt[, .N]
## [1] 134601


# DATA CLEANING...
# Check Total number of observations after cleaning:
schools_relevant[, .N]
## [1] 129743


# ...
# SAVE OUTPUT
fwrite(schools_relevant, "zambia_schools.csv", sep=",", na="", row.names=FALSE, col.names=TRUE)

TASK 2: Ward Matching & Bridging

The matching exercise is to be achieved by running a spatial join between our schools dataset, now zambia_schools.csv, and Zambia_Wards-2010-Census, a 2010 Zambia map with administrative boundaries subdivided down to the WARD level. However, given that both datasets have been collected at different points in time, there is a need to revert them back to a single reference point, which is in this case the population_census_2010, a 2010 Zambia map with administrative boundaries detailed down to the DISTRICT level. This Matching & Bridging task ,which constitutes the main part of the code, is therefore achieved in four main steps, namely adjusting our Zambia wards map to 2010 district census map, adjusting the school dataset to the 2010 district census map, and finally proceeding to the Matching and Bridging exercises.

1. Adjusting wards map to 2010 district census map

Two inputs are used for this section: The Zambia Wards map, Zambia_Wards-2010-Census & the Zambia Districts reference map, population_census_2010, and the aim of this adjustment procedure is to allocate each ward in the Zambia Wards map to the correct district with reference to the 2010 Zambia Districts map. The adjustment is achieved in 5 steps.

  • Step 1 consists in preparing the Zambia_Wards-2010-Census wards map for manual adjustment in QGIS. This is achieved by running a spatial join between both datasets, subsetting wards falling into multiple district from the reference map listing their OBJECTIDs and extracting those from the Zambia Wards Map for manual District allocation in QGIS. The resulting file is saved as mismatches.geojson.

  • Step 2 consists in performing a district allocation in QGIS, by manually checking the mismatched Wards against the Districts reference map to allocate wards to districts accordingly. The output file is then saved as mismatches_fixed.geojson.

  • Step 3 consists in loading back the harmonized - and previously mismatched - wards to R, making a list of those wards’ harmonized names and OBJECTIDs, and accordingly updating the original Zambia wards map by adding a new column, district_harm, indicating the harmonized district for each ward, including the wards that had fallen in the correct district - according to what their district information indicated prior to the spatial join. The resulting map was saved as Fixed_names_and_geometries_Zambia_Wards_2014t.prj.geojson.

  • Step 4 consisted in manually correcting for ward overlap issues in the original Zambia Wards map, using QGIS’s Vertex Tool. The final output was saved as Fixed_names_and_geometries_Zambia_Wards_2014t.prj_fixedgeom.geojson.

  • Step 5 is an assessment of the previous steps. It consists in:

    • First, producing a list of all districts in both the adjusted Zambia Wards map and the reference map to assess how these compare. [List of distrcts in Zambia map Vs District in reference map](html_docs/img/district_compare.jpg

    • Secondly, producing a three maps and assess how they compare. These are, the Zambia Districts reference map and the Zambia Wards map prior to and after adjustment.

a. Zambia Districts map

b. Zambia Wards map prior to and after adjustment.

2. Adjusting school data to 2010 district census map

This section aims at editing the districts variable of our zambia_schools.csv dataset - which has been collected over several years - to reflect the 2010 District structure as presented by the reference map, population_census_2010. A ‘brige’ provided by Prof. Torsten is therefore used:

schools<- fread("zambia_schools.csv")

#...
schools[, district_ipums := "NA"] #Set new column for 2010 districts info

schools[district_emis=="Sikongo", district_ipums := "Kalabo"]
schools[district_emis=="Mitete", district_ipums := "Lukulu"]
schools[district_emis=="Manyinga", district_ipums := "Kabompo"]
schools[district_emis=="Limulunga", district_ipums := "Mongu"]
schools[district_emis=="Nalolo", district_ipums := "Senanga"]
schools[district_emis=="Sioma", district_ipums := "Shang'ombo"]
schools[district_emis=="Mwandi", district_ipums := "Sesheke"]
schools[district_emis=="Mulobezi", district_ipums := "Sesheke"]
schools[district_emis=="Nkeyema", district_ipums := "Kaoma"]
schools[district_emis=="Luampa", district_ipums := "Kaoma"]
schools[district_emis=="Pemba", district_ipums := "Choma"]
schools[district_emis=="Chirundu", district_ipums := "Siavonga"]
schools[district_emis=="Chikankata", district_ipums := "Mazabuka"]
schools[district_emis=="Chisamba", district_ipums := "Chibombo"]
schools[district_emis=="Luano", district_ipums := "Mkushi"]
schools[district_emis=="Ngabwe", district_ipums := "Kapiri Mposhi"]
schools[district_emis=="Mwembeshi", district_ipums := "Mumbwa"]
schools[district_emis=="Chilanga", district_ipums := "Kafue"]
schools[district_emis=="Rufunsa", district_ipums := "Chongwe"]
schools[district_emis=="Sinda", district_ipums := "Katete"]
schools[district_emis=="Vubwi", district_ipums := "Chadiza"]
schools[district_emis=="Lunga", district_ipums := "Samfya"]
schools[district_emis=="Chembe", district_ipums := "Mansa"]
schools[district_emis=="Chipili", district_ipums := "Mwense"]
schools[district_emis=="Shiwangandu", district_ipums := "Chinsali"]
schools[district_emis=="Mwansabombwe", district_ipums := "Kawambwa"]
schools[district_emis=="Nsama", district_ipums := "Kaputa"]
schools[district_emis=="Chiengi", district_ipums := "Chienge"]
schools[district_emis=="Chitambo", district_ipums := "Serenje"]
schools[district_emis=="Shangombo", district_ipums := "Shang'ombo"]
schools[district_emis=="Shibuyunji", district_ipums := "Mumbwa"]
schools[district_emis=="Zimba", district_ipums := "Kalomo"]
schools[district_emis=="Mufumbwe", district_ipums := "Mufumbwe (Chizera)"]

The resulting output is ready for the spatial join and saved as zambia_schools_adjusted.csv.

3. Matching Exercise: mapping schools against Zambia Wards map (spatial join)

The following exercise is at the core of this undertaking and aims at ascertaining the location of Zambian schools in the dataset by confirming whether or not each observation falls in the ward and district indicated in the ward and district_emiscolumns respectively. Our input files are Fixed_names_and_geometries_Zambia_Wards_2014t.prj_fixedgeom.geojson for the Zambia Wards map and zambia_schools_adjusted.csv for our schools dataset. The matching exercise is achieved in two parts, the first one essentially being the spatial join and the second, a series of edits & tests needed for ascertaining the location of each school.

Part 1: Spatial Join

Part 1 starts with loading both input files, subsetting the portion of the schools dataset with GPS coordinates, and running at spatial join using st_intersect.

schools<- fread("zambia_schools_adjusted.csv")
# total Number of observations in the schools dataset:
schools[, .N]
## [1] 129743

# With coordinates:
schools[is.na(lat)==FALSE, .N]
## [1] 77648

# Without:
schools[!is.na(lat)==FALSE, .N]
## [1] 52095

After the special join, our main data table names spatial_join is created. We then rename a few variables for better clarity and create the first similarity scores - wards_similar & districts_similar_harm - which evaluate the Levenshtein distance, as a means to assess whether wards names and district names are the same in both the Zambia map and our schools data for each observation. The first tests are run subsequently:

# total Number of observations in the spatial_join data table:
spatial_join[,.N]
## [1] 77647

# Number of observations falling in a district with the exact same name it indicated prior to the spatial join:
spatial_join[districts_similar_harm== 1, .N]
## [1] 70712

# Number of observations falling in a district with a name similar to the one it indicated prior to the spatial join (mismatches due to typos)
spatial_join[between(districts_similar_harm, 0.6, 0.99)==TRUE, .N]
## [1] 0

Part 2: matching

This second part of our matching exercise consists in running a series of tests & edits in order to ascertain the location of as many observations as possible from the 77,647 observations (corresponding to 7130 schools recorded over several years) in the schools dataset. This undertaking is itself subdivided and achieved in a series of six steps which are:

  • case 1: same districts, same wards
  • case 2: same districts, similar wards
  • case 3: same districts, different wards
  • case 4: different districts, similar wards
  • case 5: different districts, same wards
  • case 6: different districts, different wards
Case 1: same districts, same wards

In this first step we identify the observations falling in a district and ward with the exact same names it indicated prior to the spatial join. These are our perfect matches.

# Number of obs with same district & wards names:
spatial_join[districts_similar_harm== 1 & wards_similar== 1, .N]
## [1] 35097

We infer from the test above that 35,097 observations out of a total of 77,647 are perfect matches. This represents 47.8% and no subsequent edit is required.

Case 2: same districts, similar wards

The second case deals with observations that fall in the district indicated on the schools dataset, but in wards that have slightly different names on the map. We derive this group using one of our similarity scores, namely wards_similar, deducing that strings with a Levenshtein distance ranging between 0.6 and 0.99 are close enough to be referring to the same ward in the case of Zambian naming patterns. Of course, with exceptions. Hence:

# number of same district names & similar wards names:
spatial_join[districts_similar_harm== 1 & between(wards_similar, 0.6, 0.99)==TRUE, .N]
## [1] 6787

We assume that these slight differences are typo-related and the adjustement applied to this group of observations consists in creating a new variable, ward_harm that takes on the value of the harmonized name chosen for each particular observation. In this specific case, the harmonized value is the name of the given ward as spelled by the map. Hence or series of adjustment look like the following:

spatial_join[mismatch_ward == "yes" & ward_school=="chilongolo" & ward_census == "chilongolo", ward_harm := "chilongolo"]
spatial_join[mismatch_ward == "yes" & ward_school=="imatongo" & ward_census == "imatanda", ward_harm := "imatanda"]
spatial_join[mismatch_ward == "yes" & ward_school=="ilambo" & ward_census == "ilombe", ward_harm := "ilombe"]
spatial_join[mismatch_ward == "yes" & ward_school=="nkanga" & ward_census == "nsenga", ward_harm := "nsenga"]
spatial_join[mismatch_ward == "yes" & ward_school=="luche" & ward_census == "nsenga", ward_harm := "nsenga"]

#.... 

spatial_join[mismatch_ward == "yes" & ward_school=="mazabuka" & ward_census == "mazabuka central", ward_harm := "mazabuka central"]
spatial_join[mismatch_ward == "yes" & ward_school=="kaluweza/ngabo" & ward_census == "ngabo", ward_harm := "ngabo"]
spatial_join[mismatch_ward == "yes" & ward_school=="chambi-mandalo" & ward_census == "chambi", ward_harm := "chambi"]
spatial_join[mismatch_ward == "yes" & ward_school=="mateyo" & ward_census == "mateyo mzeka", ward_harm := "mateyo mzeka"]

After the adjustments, we create a new similarity score wards_similar_harm, this time comparing the harmonized ward name - ward_harm - to the ward name from the Zambia map (in which the observation falls). It is worth noting here that ward_harm takes on the value of the school dataset’s ward - which is the same as the Zambia map’s - for all observations falling in the Case 1. We then re-run tests:

# number of obs with same district names & similar wards names:
spatial_join[districts_similar_harm== 1 & between(wards_similar, 0.6, 0.99)==TRUE, .N]
## [1] 6787

# number of obs with same district names & similar harmonized wards names:
spatial_join[districts_similar_harm== 1 & between(wards_similar_harm, 0.6, 0.99)==TRUE, .N]
## [1] 0

# Number of obs with same district names & names wards names:
spatial_join[districts_similar_harm== 1 & wards_similar== 1, .N]
## [1] 35097

# number of obs with same district names & same harmonized wards names:
spatial_join[districts_similar_harm== 1 & wards_similar_harm== 1, .N]
## [1] 41982

Our reference from now on is the wards_similar_harm variable given that we care about the progress in our ward_harm variable, which is the the main output of this matching exercise.

Case 3: same districts, different wards

After solving for typo issues, this third case deals particularly with observations that do fall in the district indicated in the schools dataset but in a different ward. For these observations, their ward_harm variable still takes on an ‘NA’ value, hence they can be deduced using to test queries:

# Number of obs with same district names & different wards names: 
spatial_join[districts_similar_harm== 1 & wards_similar_harm !=1, .N]
## [1] 28730

# Number of obs with harmonized ward names that still take on an "NA" value:
spatial_join[districts_similar_harm== 1 & ward_harm == "NA", .N]
## [1] 28730

Note that those observations amount to 28,730 versus 41,982 for observations with perfect match in harmonized value (28730 + 41982 = 70712, the total number of observations falling in the exact same disctrict indicated).

The matching procedure for this case consists in assessing whether for each of these observations, the GPS coordinate given is less that 15k away from the indicated ward. In this case , we choose our harmonized value to be the ward indicated by the school dataset (i.e indicated by “the school headmaster” filling the survey). In solving this case in such a way we are assuming that the GPS coordinates are slightly wrong, hence the school seems to be falling in a different ward although it remains in the same district. Our code hence involves creating a 15 km buffer around the designated schools, tying each school to wards within a 15km distance using st_intersect, computing a similarity score between the schools’ indicated ward name and the names of each ward within that 15km distance, and finally creating a variable that indicates the highest name similarity within 15km for each school. For the school observations that passed the similarity test, we concluded that the ward “indicated by the school headmaster” was indeed the correct one, and updated the ward_harm value accordingly. It is worth noting that this process was repeated twice - COMPUTATIONAL SOLUTION - Part 1 & COMPUTATIONAL SOLUTION - Part 2 - to account for schools whose ward indicated ward name changed over time due to national-level revisions in adminitrative unit’s - wards in this case - names.

After the adjustment made, we re-generated the wards_similar_harm similarity score and re-ran the initial tests :

# Number of obs with same district names & different wards names: 
spatial_join[districts_similar_harm== 1 & wards_similar_harm !=1, .N]
## [1] 28352


# Number of obs with harmonized ward names that still take on an "NA" value:
spatial_join[districts_similar_harm== 1 & ward_harm == "NA", .N]
## [1] 8988

It is worth noting here that while the number of observations with same district names and different wards names haven’t decreased [by very much needs work -> shouldn’t decrease at all if case 2 is effective], the number of observations with harmonized ward names taking on an “NA” value decreased significantly from 28730 down to 8988 [needs work -> for the same reason]. Hence this step allowed us to harmonize 28730 - 8988 = 19742[needs work -> for the same reason] observations using the 15km range approach.

COMPUTATIONAL SOLUTION - Part 1 & COMPUTATIONAL SOLUTION - Part 2 completed, there still remains 8988 observations which fall in the same district but in a different ward, with the ward “indicated by the school headmaster” either non-existant or beyond the 15km distance. These cases are harmonized in COMPUTATIONAL SOLUTION - Part 3 by choosing the GPS location as our reference, hence updating the ward_harm variable to reflect the ward name as indicated by the Zambia map. Our final test for this section are the following:

# Number of obs with harmonized ward names that still take on an "NA" value:
spatial_join[districts_similar_harm== 1 & ward_harm == "NA", .N]
## [1] 0
Case 4: different districts, similar wards

The first 3 cases above pertained to the group of observations (70712 out of 77647) falling into the exact same district “indicated by the school headmaster”, none of which was discarded. The 3 last cases deal with observations falling into a different district then the one indicated on the schools dataset. Those amount to 6935 observations and are subdivided as such:

# Number of obs with different district names: 
spatial_join[districts_similar_harm!= 1, .N]
## [1] 6935

# Subsetting the above group
mismatches <- spatial_join[districts_similar_harm!= 1]

# Number of obs with different district names but same harmonized Ward names: 
mismatches[wards_similar_harm== 1, .N]
## [1] 1035

# Number of obs with different district names and different harmonized Ward names: 
mismatches[wards_similar_harm!= 1, .N]
## [1] 5900


# Number of obs with different district names but similar Ward names: 
mismatches[between(wards_similar, 0.7, .99), .N]
## [1] 156

# Number of obs with different district names and different Ward names: 
mismatches[between(wards_similar, 0, .69), .N]
## [1] 5756

Case 4 deals with the group of observations with different district names but similar ward names. Just as in Case 2, we use the Levenshtein distance to determine which ward mismatches are related to typos in the ward names. Hence our adjustment takes the following form:

#correcting for observations for which between(wards_similar, 0.7, .99) == TRUE
spatial_join[school_num==2703, ward_harm := "mukulika"]
spatial_join[school_num==3535, ward_harm := "mwininyilamba"]
spatial_join[school_num==3538, ward_harm := "mwininyilamba"]
spatial_join[school_num==3547, ward_harm := "kanongesha"]
spatial_join[school_num==3550, ward_harm := "kanongesha"]
spatial_join[school_num==3568, ward_harm := "kanongesha"]
spatial_join[school_num==3574, ward_harm := "kanongesha"]
spatial_join[school_num==3577, ward_harm := "kanongesha"]
spatial_join[school_num==3580, ward_harm := "mwininyilamba"]
spatial_join[school_num==3585, ward_harm := "kanongesha"]
spatial_join[school_num==5192, ward_harm := "kanongesha"]
spatial_join[school_num==7400, ward_harm := "kasengo"]

After adjusting for the above cases, we update the wards_similar_harm variable re-run the tests:

# Number of obs with different district names: 
spatial_join[districts_similar_harm!= 1, .N]
## [1] 6935

# Subsetting the above group
mismatches <- spatial_join[districts_similar_harm!= 1]

# Number of obs with different district names but same harmonized Ward names: 
mismatches[wards_similar_harm== 1, .N]
## [1] 1179

# Number of obs with different district names and different harmonized Ward names: 
mismatches[wards_similar_harm!= 1, .N]
## [1] 5756

# Number of obs with different district names but similar Ward names: 
mismatches[between(wards_similar, 0.7, .99), .N]
## [1] 156

# Number of obs with different district names and different Ward names: 
mismatches[between(wards_similar, 0, .69), .N]
## [1] 5756

Note that the number of observations with different district names but same harmonized ward names increased from 1035 to 1179, while the the number of observations with different district names and different harmonized Ward names decreased from 5900 to 5756. Therefore our sample of 6935 observations with a district mismatch is now divided in two groups whose matching procedure will be discussed in the last two cases.

Case 5: different districts, same wards

Case 5 refers to the group of observations with a district mismatch but an exact ward match.

# Number of obs with Same harmonized Ward names but with different district names: 
mismatches[wards_similar_harm== 1, .N]
## [1] 1179

# Number of obs with Same harmonized Ward names  and different but ACCEPTED harmonized district names: 
mismatches[wards_similar_harm == 1 & districts_similar_harm == 1.1 , .N]
## [1] 0

These observations amount to 1179 and require no adjustment because ward values being unique, an exact ward match with the Zambia map also indicates that the (harmonized) Zambia district within which the observation falls is correct. It remains that there is a discrepancy between the district indicated “by the school headmaster” and the harmonized district. to resolve this issue, the similarity score districts_similar_harm for those observations is updated to an arbitrary distinctive value, 1.1.

A second round of the tests results in:

# Number of obs with Same harmonized Ward names but with different district names: 
mismatches[wards_similar_harm== 1, .N]
## [1] 1179

# Number of obs with Same harmonized Ward names and different but ACCEPTED harmonized district names: 
mismatches[wards_similar_harm == 1 & districts_similar_harm == 1.1 , .N]
## [1] 1179
Case 6: different districts, different wards

After dealing with the matching decision for the 1179 out 6935 observations with different harmonized district names but same wards names, we now deal with the remaining 5756 observations with different district names and different ward names. Any matchin method for this group will be quite arbitrary as the ward and district information “indicated by the school head master” is in complete contrast with the GPS information. Therefore, No adjustment is made for this group. The following map pictures Case 6 of observations’ locations on the Zambia wards map.

Zambia Districts map

Our final round of tests are indicated below, and the matching exercise’s final output is saved as zambia_schools_matched.csv.

# Total Number of observations
spatial_join[, .N]
## [1] 77647

# Total Number of observations harmonized after solving cases 1 to 6 
spatial_join[districts_similar_harm == 1 | districts_similar_harm == 1.1  , .N]
## [1] 71891

# Total Number of observations NOT harmonized after solving cases 1 to 6 
spatial_join[districts_similar_harm != 1 & districts_similar_harm != 1.1  , .N]
## [1] 5756

4. Bridging Exercise: creating a bridge for observations with no gps informaion

This final exercise of the code consists in using the successful matches of the previous section as a bridge to map the portion of the initial schools dataset, zambia_schools_adjusted.csv with no GPS coordinates. Hence we load both the zambia_schools_matched.csv and zambia_schools_adjusted.csv (from which we derive schools_nocoord, the group of 52095 observations with no GPS coordinates), create a bridge using the successful matches by finding the more recurrent harmonization value for each harmonized ward, and extrapolate such values to the observations in the schools_nocoord with a corresponding ward.

We assess the results of this procedure by create match_status variable determining which of the observations in schools_nocoord have effectively been mapped using the bridge created from successful matches in the previous section. Hence we otain the following:

schools_nocoord[,table(match_status)] 

## match_status
##    match no_match 
##    43756     8339

The output of this bridging exercise is therefore characterized by 43756 successfully matches out of 52095 observations (roughly 84%), and is saved as harmonized_noGPS.csv.

5. Results Summary


##### 5.1 Waterfall chart illustrating matching outcomes


##### 5.2 Zambia unmatched EMIS Schools prior to and after adjustment

Rendering

rmarkdown::render(“Zambia - School electrification - School to ward matching.rmd”)